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This thesis is concerned with the implementation of an adaptive identification 
algorithm using parallel processing and systolic arrays. In particular, discrete samples 
of input and output data of a system with uncertain characteristics are used to 
determine the parameters of its model. The identification algorithm is based on 
recursive least squares, QR decomposition, and block processing techniques with 
covariance resetting. Along similar lines as previous approaches, the identification 
process is based on the use of Givens rotations. This approach uses the Cordic 
technique for improved numerical efficiency in performing the rotations. Additionally, 
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I. INTRODUCTION 



A. BACKGROUND 

The problem of adaptively controlling systems with uncertain characteristics 
depends largely on identification of the unknown system parameters. The ability to 
accurately and quickly estimate these parameters, then, is of primary importance. The 
state of computer development today has made this estimation possible in real time. 

The goal of parameter estimation is to best fit an appropriate model to the input- 
output data of the plant under investigation. This immediately leaves us with two 
basic and distinct problems: the choice of a parameter model and the choice of an 
estimation algorithm. 

We desire to select a parameter model which relates input and output data by 
means of weighted parameters, in the form 

>'(t) = <P T (t) 0 + v(t) (1.1) 

with y(t) and <p T (t) representing the output and signals respectively available for 
measurements, and where 0 is an array of unknown parameters to be determined. The 
term v(t) represents noise or other modeling errors. Though numerous choices of 
model structures exist, linear models still remain the most desirable due to their 
simplicity and the considerable amount of theory developed to analyze them. 

Secondly, we need to choose an appropriate estimation algorithm. Again, several 
possibilities exist, but the most effective in terms of converging is the recursive least 
squares algorithm [Ref. 1: pp. 49-68]. This algorithm is implementable using on line 
matrix manipulations, and the technique is based upon on line solution of a system of 
linear equations. 

The major drawback of recursive least squares is the computational complexity 
involved. The size of the matrices grows with the complexity of the system to be 
modeled. Although available microprocessors are effective for low order systems and 
slow' sampling rates, more complex problem require improved capabilities. These 
capabilities are provided by systolic arrays built using VLSI technology. 
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This research analyzes the on line recursive least squares identification algorithm. 
This algorithm processes blocks of data, using values gained from the previous block as 
initializing values, in a method known as block processing. This block processing 
technique is based on QR decomposition, discussed in the next chapter. 

B. IMPLEMENTATION CONSIDERATIONS 

The particular computing structure we are considering is based on the systolic 
array (or wavefront array). The systolic array consists of an array of individual 
processing cells, each provided with a local memory and processing unit of its own, 
connected in such a way that each cell communicates only with its nearest neighbors. 
The array is designed so that data is continuously clocked or pumped throughout in a 
rhythmic fashion; hence the name "systolic." The cells are simple in that they are 
required to perform only basic mathematical functions on the data received from 
neighboring cells. Special purpose hardware incorporating systolic arrays can be built 
using VLSI technology. The advantage with this technique is the fact that a complex 
operation is performed by several processors at a time, thus increasing the throughput 
of data. 

Previous authors [Refs. 2,3: pp.29 - 37, pp.255 - 274] have presented parameter 
estimation algorithms using systolic arrays. The general idea has been to solve a 
system of linear equations in two stages: first, by triangularization of the matrix of 
coefficients, and second, solving by successive substitution. The previous algorithms 
have all been similar in that they used a triangular systolic array to triangularize the 
matrix, and a linear systolic array configuration to solve for the parameters. It turns 
out that the arrays for the triangular and linear sections are different, and furthermore 
the linear section requires operations such as divisions which are hardly implementable 
by simple processor operations. 

In this thesis we present an algorithm which is based on two identical triangular 
sections. It is characterized by the fact that only orthogonal operations are involved, 
thus making the algorithm numerically more stable, and easily implementable by simple 
shift and add operations. Also, one of the advantages is that only two different types 
of cells are necessary (as compared to four in the previous implementations), resulting 
in a simplification in the manufacturing process. These cells perform different 
functions from those used by the above referenced authors. The Cordic technique is 
used by the cells to perform vector rotations, resulting in improved numerical 
efficiency. Though more total cells are required in this implementation, the cost of 
additional cells in a VLSI scheme is considered to be minimal. 
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The use of fixed point versus floating point arithmetic is considered during this 
investigation. Because fixed point operations are based on simple shift functions and 
finite registers, which are simple to implement, it seemed advantageous to use fixed 
point values. However, since input and output data do not naturally appear as integer 
values, there was concern over loss of accuracy due to necessary scaling and 
truncation. With proper choice of scaling factors, it will be shown that the integer 
methods perform as well as floating point. 

This thesis is divided as follows: Chapter II discusses the methods of system 
identification, i.e. solution of systems of linear equations, QR decomposition, and 
recursive least squares; Chapter III discusses the Cordic technique and the 
implementation of the systolic arrays. Chapter IV discusses the simulation results, and 
Chapter V draws the final conclusions. A listing of the program used to simulate the 
systolic arrays is found in the appendix. 
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II. SYSTEM IDENTIFICATION METHODS 



A. LINEAR SYSTEM MODELING 

The first step in any parameter identification process is to model the system in 
mathematical terms. To this end, consider a general system with a single input u(t) 
and single output y(t), as shown in Figure 2.1. 




y(t) + a i> r (t-l) + ... + a n y(t-n) = b^t-l) + ... + b m u(t-rn) + v(t) (2.1) 

where the aj's and the bj's are real constants, and the equation (and system) is of n 1 * 1 
order. The v(t) term denotes a noise variable. Equation (2.1) represents a class of 
discrete systems, known as recursive systems because the output depends not only on 
the input but on the previous output values also. 

An alternative way to represent the above equation is by defining the parameter 
vector as 



® l^]> *** > a n’ ’^nJ 

and the regression vector of input-output data as 
<P T (0 = l *y(M) . - ,-y(t-n), u(t-l), ... , u(t-m)] 



( 2 . 2 ) 



(2.3) 
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Now, using (2.2) and (2.3), we can rewrite (2.1) as 



y(t) = (p T (t) 9 + v(t) 



(2.4) 



The vector 0 contains the parameters which we want to determine. This can be 
obtained by sampling the system at some sample frequency (say, 0) $ ) and accumulating 
a numerical sequence that describes the system at discrete time intervals. We can 
express this in the form of a system of linear equations by considering the sequence at 
several instants of time. Ideally, if the number of samples of u(t) and y(t) (i.e. the 
number of equations) equals the number of unknowns (m + n), we should be able to 
solve exactly for 0. However, it is normally the case that the number of equations is 
greater than the number of unknowns. This system may or may not have a solution. 
Ideally, in the noiseless case (v(t) = 0) a solution exists, regardless of the number of 
equations. However, since noise and numerical errors are present, we look for the least 
squares solution of the system of equations, i.e. for the solution which minimizes the 
error 



This always exists, although it might not be unique. 

B. SOLUTION OF SYSTEMS OF LINEAR EQUATIONS 

In the previous section we saw how the linear system could be modeled by (2.4). 
When numerous samples are taken, we end up with a system of linear equations, such 
as 



£(0) = II A0 - b II 2 



(2.5) 



y(t) - <p T (t) 0 + v(t) 
y(t-l) = <p T (t-l)0 + v(t-l) 

y(t-2) = <p T (t-2) 0 + v(t-2) 



This system of equations can be expressed in the form 



A0 = b 



( 2 . 6 ) 
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where A 6 ^? mxn , 0 6 , ]2.e 3% m . Now, when m = n, and A is full rank and 

invertible, we can solve uniquely for 0 as 



However, in general we find that m > n (i.e. more equations than unknowns). In this 
instance, the solution of (2. 7) is defined in the least squares sense as 



Although (2.8) can be solved by pseudoinverse, this technique is not attractive due to 
the presence of matrix inversion. An alternative way is to triangularize A in (2.6) and 
then solve for 0 by successive back substitutions. This, of course, is the basis of 
Gaussian elimination techniques. 

But now w’e face the dilemma of triangularizing an array when m > n. The 
solution is to triangularize the upper part of the A matrix, leaving one or more rows of 
zeros at the bottom as a result. This will permit us to solve for 0 in the least squares 
sense, as indicated in (2.8). This is known as QR decomposition. (For review of least 
squares estimation in statistical theory, the reader is referred to the literature.) 

C. QR DECOMPOSITION 

A means to solve for the least-squares solution of a system of linear equations is 
provided by the QR decomposition of a matrix [Ref. 4: pp. 209-215]. Consider again 
Equation (2.6) with m > n, where 0 = 0 s (i.e. least squares sense). We can factor the 
m x n matrix A as 



0 = A' l h 



(2.7) 



0 S where ||A0 S - b|| = min 0 ||A0 - b|| 



( 2 . 8 ) 



A = QR 
x “ « 



(2.9) 



where Q is a m * n orthogonal matrix such that QQ T = I, and £ is defined as 



R 



( 2 . 10 ) 



R = 



0 
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where R is an n * n upper triangular matrix, and 0 represents a zero matrix. Now we 
can rewrite (2.6) as 



R 


0 = Q T b = 


0x 


0 
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And finally, from (2.11), it follows that 

R0 = P, (2.12) 

Now, the solution of (2.6) in the least squares sense is the same as the solution of 
n equations and n unknowns in (2.12). 

Proof. Consider the set of equations 

e = A0 - b = QR0 - b 

where e represents the least squares error vector. Then, 

e T e = (b T - 0 t A t ) (A0 - h) 

Q T e = R0 - Q T b = R0 - p 

Then, we have 

e T e = e T QQ T e = || R0 - p || 2 



II 




R 


0 - 


Pi 








‘o’ 




t 2 





R0 * P[ 2 

*P 2 

- II R0 * Pi II 2 + II P 2 II 2 

We see that P 2 is independent of 0. Therefore, ||e|| 2 is minimized when 
0,, - R*‘Pi or R0, s =■ pj ♦ 
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Based upon these results, we see that we can now solve (2.12) for 0 by back 
substitution. 

There are a number of methods available that can be used to compute the upper 
triangular matrix R. In particular, the Givens rotation (Ref. 5: pp. 497 - 499] is 
attractive for systolic array implementation since it is based on manipulation of pairs 
of adjacent rows, thus requiring minimum broadcasting of data. Cordic techniques are 
used to implement these rotations, and are discussed in the next chapter. 

D. RECURSIVE LEAST SQUARES 

The problem now is to determine a recursive algorithm that can be used to 
estimate the parameter 0 in the model 

y(t) - <p T (t)0 (2.13) 

We desire to estimate 0 from a set of measured data y(t) and <p(t). In the least squares 
sense we choose this estimate by best fitting our model to the available data. That is, 
we wish to minimize £, where 



|€| = |y(t) - (p T (t) 0| (2.14) 

A A 

0 refers to the estimate of 0 at any time t. In particular 0 t satisfies the equations 

A(I-l)#, = y(l-l) ' (2 ' 5) 

It is possible to. show that 0 t can be recursively computed as (Ref. 6: pp. 17 - 22] 

(2.16) 



9 T (!-1) 




y(i-l) 


• 


e t = 


* 


* T (0) 




y(o) 



fl -5 , mu 9(0 (y(0 - 8, T <p(0) 
fl .+i - ®. + p ('-» , + /( 



l (t) P(t-l) (p(t) 
where P(t) = (A(t) T A(t))‘ l satisfies the recursion 

<p(t) <p T (t) P(t-l) 



P(t) = P(t-l) - P(t-l) 



1 + <p T (t) P(t-l) (p(t) 



(2.17) 
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To this point, we have ignored the problem of the existence of the solution. If 
A(t) is singular, then P(t) does not exist. Furthermore, we can not initialize A(-l) = 0 
because this would leave P(-l) undefined. The solution is to incorporate initial 
conditions into A(t) so that P(t) can be computed at each t. Choose A( ; l) = <r Q I, 
where <t q > 0 is some arbitrary constant, and I is the identity matrix of appropriate 
dimensions. Then, by algebraic manipulation, Equation (2.15) becomes 



0 T W-1) 




>•(<-1) 


* T (o) 


e t = 


y(’o) 


— 




— — — — 


A(-l) 


• 


A(-l)^ 0 



(2.18) 



The solution to (2.18) can be made fully recursive using (2.16) and (2.17) and 
appropriate initial conditions. 

E. BLOCK PROCESSING AND COVARIANCE RESETTING 

In the previous section, an algorithm was described that allow's us to estimate the 
parameter 0 recursively. However, note that from the definition of P(t) (also known as 
the covariance matrix) 



P(t) = (A T (t) A(t))’ 1 = (<x 0 I + i cp(i) <p T (i) r 1 

Wo 



(2.19) 



that P(t) 0 as t -» oo [Ref. 6: p. 21]. Therefore, the algorithm loses sensitivity as t 
increases, and later values of 0 may not be as accurate as earlier values, especially if 
our model changes with time. There are two possible remedies to this problem: the use 
of a "forgetting factor", or the block processing approach. 

The forgetting factor minimizes the error 




E A'-* (y(k) - ^(k)*,.,) 2 + ol 

k=0 




( 2 . 20 ) 



w'here 0<X< 1 is the forgetting factor. This has the effect of assigning a higher weight 
to more recent data. 
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The block processing approach, on the other hand, divides the time set into 
segments of equal and fixed length N as in Figure 2.2. At the end of each time 







•••'III 1 III 

kN (k+ 


1)N 



Figure 2.2 Time Line. 

block, we reset the covariance matrix P(t). Although it can be reset at any time, for 
convenience we choose the end of each block, and P(t) now becomes 



P(t) = 



1 



-1 



I 



Equation (2.17) 



t = kN-1 
otherwise 



( 2 . 21 ) 



Therefore, the beginning of each time block is treated as a new, initialized period. 

It has been shown [Ref. 2: p. 20] that an external input u(t), sufficiently rich in 
frequency (n sinusoids), together with blocks I k of sufficient length N provide a 
guarantee for an estimation 0 -► 0 where 0* represents actual parameters. The effect 
of various lengths of N will be investigated later. 

If we apply the considerations above to the general case, we can see that the 
parameter estimates at the end of each time block are given by 



0 T ((k + l)N-l) 
• « 




y((k + l)N-l) 


<* T (kN) 


^(k+l)N = 


y(kN) 


<7oI 




A 

^O^kN 



This will be the foundation with which we build our recursive parallel algorithm. 

Although the algorithm presented in this chapter assumes a single input, single 
ouput (SISO) plant, it can easily be extended to the multiple input-output (MI MO) 
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plant [Ref. 7: pp. 25-27], Additionally, we assume our plant to be causal. 
Furthermore, because the problem of order determination is necessarily complicated, it 
will be assumed that the order of the plant is always known to the designer. 
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III. SYSTOLIC ARRAY IMPLEMENTATION 



A. IMPLEMENTATION OF ALGORITHM 

We now turn our attention to the solution of (2.22) using systolic arrays. In 
particular, (2.22) is suited for parallel computation. As described in [Ref. 2: pp. 23 - 
25], the identification problem can be reconstructed into a set of linear equations as 

R k ®(k+l)N = Pkl 

with R k in upper triangular form. We see here that 0^ k + can be computed by using 
two processors in cascade: one to compute R k and P kl , and the second to compute 0 
from (3.1). Note that implicit matrix inversion is not necessary since R k is in upper 
triangular form. 

As discussed in the previous chapter, we initialize the array at the beginning of 
each time block to <T 0 I (left half) and <7 0 ^kN (right half). This has the effect of 
initializing the R k matrix in (3.1) to an upper triangular form. Then, at each discrete 
time n, an array of data <p(n) and y(n) will be passed to the systolic array. The task of 
the array is to "re-triangularize" the data at each clock pulse, so the matrix remains in 
the form prescribed by (3.1). It is then a simple matter to solve for 0( k + 1 ^‘ This 
technique is based on QR decomposition as the means to triangularize the data array. 

The value of <T 0 relates to the confidence we have in the initial estimates. The 
larger the value of c 0 , the lower our confidence. Equations (2.18) - (2.22) illustrate the 
role that <r Q plays. 

B. GIVENS ROTATION 

An attractive and easily implemented method of triangularization by QR 
decomposition is provided by the Givens rotations. The reason for this choice is the 
fact that the Givens rotation operates only on adjacent data, making it suitable for 
systolic array implementation. The object is to combine two adjacent rows in the 
matrix, forcing zeros in the appropriate positions so to obtain an upper triangular data 
matrix. Figure 3.1 shows an example of triangularization by successive Givens 
rotations. 
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The underlying idea for this scheme is that any two rows of elements a ; j .. a.^, 
a i + 1 1 '* a i+ 141 ma y b e considered as a sequence of vectors (x,y) in & 2 as in Figure 3.2. 




* Figure 3.2 Vectors in 0t 2 . 

If we rotate a vector (a. ., a. . , .) by an angle a so to make it parallel to the x axis, then 
the y value (a- + j ,•) becomes zero. This same rotation a is then applied to the 
remaining (x,y) vectors in the afTected rows (in this case, rows i, i+1) to ensure 
algebraic equality. Next, the sequence of rotations is repeated for the remaining pairs 
of rows (i.e. rows i+1, i+2; i+2, i + 3; ... ) in order to force zeros in the correct 

locations that leave an upper triangular matrix. 
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Recall that the data matrix is initialized to an upper triangular form (cr 0 I). Then, 
as we take samples from the signals in the plant, new values of <p(t) and y(t) are added 
to the matrix. By a sequence of Givens rotations, we can transform it into an upper 
triangular matrix. An example set of rotations is shown in Figure 3.3, where the 




operations performed by the systolic array at each clock pulse are illustrated. This is 
repeated until t = kN (i.e. at the end of the time block), at which time the parameters 
0 are solved for, the matrix is reinitialized, and the process is repeated. 

The basis of the Givens rotation is the matrix 



Q(p.q) 



Ij 0 0 

0 r(p,q) 0 
0 0 i 2 



(3.2) 



associated to each pair of indexes p,q e (l,n+ 1), with Ij and I 2 being identity matrices 
of dimension (q < -2) * (q-2) and (n+ 1-q) * (n+ 1-q) respectively, and r is a 2 * 2 matrix 
of the form 



c(p,q) s(p,q) 

•s(p,q) c(p,q) 



(3.3) 



The matrix Q has the property that it affects only two rows at a time, i.e. rows q and 
q-1. Application of the transformation Q(p,q) to any matrix A of appropriate 
dimension yields 
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Q(p.q)A 



XXX 
X 1 X 

x 0 x 

XXX 

+ 

p 



q 



provided c(p,q) and s(p,q) in (3.3) are such that 



(3.4) 



c(p.q) a q ., p + s(p,q) a qp = 1 
c (P-q) a qp “ s (P-q) a q -i. p " 0 



(3-5) 



In Figure 3.3, application of the Givens rotation Q(3,4)Q(2,3)Q(1,2) to the left 
matrix results in the rotated matrix shown on the right. We see that to perform Givens 
rotations, we must be able to calculate addition, subtraction, multiplication, division, 
and squares. Next, we will see how to perform the rotations, by using add and shift 
operations only. 

C. CORDIC TECHNIQUE 

The CORDIC (Coordinate Rotation Digital Computer) technique [Ref. 8: pp. 
162-165] is particularly attractive in fixed point arithmetic for generating different types 
of trigonometric functions. The principle involved is to rotate a vector (x,y) through 
an angle a by a series of "properly quantized" angular steps. The single rotations of 
the vector (x, y) are computed at each step by a combination of simple add, subtract, 
and shift operations. 

Consider again a vector in the plane as in Figure 3.2. The vector is represented 
by its x and y components. Rotating the vector through any angle 5 leaves us with 
new rotated coordinates 



x' = x cos 8 ± y sin 8 
y' = y cos 8 T x sin 6 

where the top sign refers a clockwise rotation. 
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In the CORDIC technique, the rotation is performed in a sequence of angular 
steps tOj, such that the sum of them approaches 8. The to/s can be positive or 
negative, so 

8 = to Q ± o)j ± ... ± w n (3.7) 

If we define y. = ±1, then we can express 8 as 

8 = f YjO). (3.8) 

l«0 

The choice of the angles (Oj is determined on the basis of computational convenience. 
In particular, choose 

(Oj = tan’ 1 (2*‘) i = 0,1,.... (3.9) 

and consider a single rotation by an angle 0)j. By applying (3.6) and dividing by cos 0)j 
we obtain 



= x. ± yjtantOj = x i+1 



COS 0). 



COS <0. 
1 



— - y-t =F x i tanw i = y i+1 



(3.10) 



The factors Xj’/cos co. and y.'/cos (Oj are the rotated components of the vector (x,y). 
Note that the new vector is not only rotated, but also scaled by a factor 1/cos 0)j. 

Now, combining (3.7) (or (3.8)), (3.9) and (3.10) we can define a recursion 



* i+ i - * y?‘ - x i + w'‘ - V (311) 

y i+1 - y r =f -y t - - k ,y' 

for the rotation by an angle yxOj. In binary arithmetic this can be implemented with 
simple shift and add operations. Figure 3.4 illustrates a typical system for the 
determination of Yj. 
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Figure 3.4 General Block Diagram of CORDIC Computation. 

We can easily compute values of (a, i = 0,1,2,... and the corresponding scaling 
factor kj as given in Table l. 

For purposes of this research, we want to rotate an initial vector until its y value 
is equal to zero. The algebraic sum of the sequence of predetermined rotations will 
give us the angle 5 that the vector was rotated through. These rotations can be 
encoded into a ( binary sequence y that identifies the rotations performed. This 
sequence of rotations can then be passed to the remaining vectors in the affected rows. 
Figure 3.5 illustrates a series of rotations for an arbitrary vector which we desire to 
rotate by 30°. Note that in the figure, the desired angular value is approached quickly, 
because ex decreases by approximately half during each step. The y sequence in the 
figure would be ( + 1,~ 1,4- 1,— 1,4- 1,4- 1,~ 1,4* 1,— 1,4- 1,— 1). 

In an ideal case as just presented, the value of |y| decreases during each step. 
However, this may not always be the case. For example, consider the vector 
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TABLE 1 

THE BINARY CORDIC CONSTANTS 



i 


2*' 


to. 

i 

(degrees) 


k. 

i 


0 


1.0000000000 


45.00000000 


1.4142135 


1 


0.5000000000 


26.56505124 


1.581138826 


2 


0.2500000000 


14.03624340 


1.629800596 


3 


0.1250000000 


7.12501632 


1.642484060 


4 


0.0625000000 


3.57633432 


1.645688908 


5 


0.0312500000 


1.78991064 


1.646492240 


6 


0.0156250000 


0.89517384 


1.646639215 


7 


0.0078125000 


0.44761428 


1.646743467 


8 


0.0039062500 


0.22381056 


1.646756030 
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Figure 3.5 Set of CORDIC Rotations. 

(x,y) = (5,1) and o = 11.3°. The first rotation in the CORDIC algorithm (see Table 
1) would be -45°: this gives us a rotated vector (x,y) = (4.2, -2.S) and a = -33.7°. 
Note that the value of |y| has actually increased. To make the algorithm more 
efficient, we modify the sequence y to include the value zero. Whenever a rotation 
causes |y| to increase, we do not perform it, and we set the corresponding y. to zero. 
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Then we continue on with the next rotation value and repeat the process. For the 
example just cited, the next value to be tried would be 26.6°. 

A simple algorithm to perform the CORD1C rotations can be seen in Figure 3.6. 



INITIALIZE: ifx > 0 then x 0 = x, y Q = y 

else x Q = -x, y Q = -y 

if y > 0 then y 0 = + 1 
else y 0 = — 1 

for i : = 0 to N — 1 do (* number of iterations *) 

if — 7i^*' x i > 3"i t i len (* lyl increases *) 

x m := x i 
>'m := >i 

Vi 0 

else 

x i+i := x i + y\ r> y\ 
y i + i := >'i - r i 2' i x i 

if v. > 0 then y j + { : = +1 
else y i+1 := - 1 

end for; 
end. 



Figure 3.6 Algorithm for CORDIC Rotations. 



D. SYSTOLIC ARRAYS 
1. General 

We now examine the parallel structure that will be used to solve the least 
squares algorithm described above. We desire a structure that accepts a sequence of 
regression vectors tp(n) and signal y(n) as input and then outputs the estimate for 0. 
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Specifically, we are interested in a high performance parallel strucuture that can be 
implemented directly as a hardware device in order to deliver maximum throughput. 
Systolic arrays represent a structure suitable for these characteristics. 

The key to inexpensive implementation is simple and regular interconnections. 
Additionally, we want to allow the computations to proceed concurrently with the 
input, in order to maximize the throughput. This is known as pipelining. [Ref. 3: p. 
257] 



<P 2 ( t) 0 (t) 0 n (t) y (t) 




Figure 3.7 Systolic Array. 

A systolic array meets these requirements. Figure 3.7 shows a typical system. 
As noted earlier, the array is simply a network of processors that are regularly 
connected. The data is continuously "pumped" through this structure, thereby 
minimizing overall execution time, since all the processors work in parallel. 

It was shown that two processors would be necessary to solve (3.1). Previously- 
used configurations have consisted of a triangular array (as in Figure 3.7) to compute 
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the upper triangular matrix, and a linear array to solve the system of equations. Figure 
3.8 shows the typical design for the case where d (dimension) = 4. This thesis 
discusses an alternative configuration in which the linear section is replaced by a 
second triangular section identical to the first one. This new design is discussed later in 
this chapter. Both designs use a single clock signal to control operations. We will now- 
review the structures of the triangular and linear sections before continuing on to 
discuss the alternative design. 

2. Triangular Array 

The triangular systolic array performs the rotations as described in the section 
on the CORD 1C technique. The processors work simultaneously at each clock pulse. 
The data regression vector <p(t) and output signal y(t) is input to the top of the array, 
and rotations are calculated at each clock cycle. 

The triangular array consists of two types of cells: edge (or boundary) cells 
and internal cells. The edge cells are represented by the circles in Figure 3.7 or 3.8; the 
boundary cells are the squares. Figure 3.9 defines the operations of these cells. The 
edge cells compute the rotation vector y, which consists of a sequence of ± 1 or 0 as 
discussed previously. This vector y is then passed to the internal cell on its right. The 
internal cell then rotates its (x,y) vector by the value specified by y. These operations 
are performed down the length of the affected rows. 

Each cell of the triangular array stores an element of the upper triangular 
matrix R(n) from Equation (3.1), and it is initialized to zero for internal cells and to 
<TqI for the edge cells. Then, each row of cells in the array is used to combine one row 
of the stored matrix with the data received from the above cells. As discussed 
previously, the array maintains its upper triangular form throughout the computations. 

A delay of one clock cycle per cell is incurred w-hen passing the rotation 
parameters along a row. Therefore, it is necessary to "skew" the input data as seen in 
Figure 3.8, so that the input data interacts properly with the previously stored 
triangular matrix. Because the cells are all operating simultaneously, the data in the 
system at any time t consists of values from (2n) different matrices. Figure 3.10 
demonstrates this for a system with n= 3. In the figure, we can see that at time (t + 5), 
there are also values present from the five previous matrices (i.e. t + 4, t + 3,...,t). In 
order to get all the cells in the array to a similar time state, the array would have to be 
clocked an additional 2n-l (five) cycles, feeding zeros as input where necessary. At the 
completion, all cells will be at the same time (t + 5). 
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Figure 3.8 Systolic Array Implementation for Parameter Estimation: Old Design. 
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is computed recursively 


from 


X i«*l : 


= x. ♦ 
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r +i 


if J i > « 
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if y. = 6 or 


the rotation 




1 would 


increase |y| 


1 


L -i 


if y. < • 

l 





INTERNAL CELL 
X 



(t) 




y' 

(t*D 
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z* is computed recursively from 

x i.i = x i + 7 i 2 'V i 

j 9 is computed recursively from 

7 i*i = 7 i -^i 2 x i 

(over the sequence ^ ) 



out 



7. 



in 



Figure 3.9 Definition of Cell Operations for Triangular Section. 

Note that at the same time the triangularization process is being carried out, 
the column vector P kl is also being computed by the right side column of internal cells 
using y(n) as its input. At the end of the triangularization period (N), we are ready to 
solve for 9^+^- The data in this triangular array is clocked out to the next array 
section that will compute the parameters. 

3. Linear Array 

The linear systolic array has been used in previous implementations to solve 
for the estimated parameters. The linear section consists of one boundary cell and 
(d-1) internal cells as seen in Figure 3.8. The operation of the cells as they compute 
the parameters are shown in Figure 3.11. Note that the cells are different from those 
in the triangular array, increasing to four the number of unique cells necessary in the 
combined system. 



31 




32 






Figure 3.11 Definition of Cell Operations for Linear Section. 

It is shown in [Ref. 2: p. 36] that the time required to solve for 0 using the 
linear array is equal to 2d. At the end of that period, the parameters are used as initial 
values for the triangular array, and the triangular section again commences operation. 

We now turn our attention to the replacement of the linear section by a 
second triangular system, and discuss the differences between the two designs. 

4. The Use of a Second Triangular Array as the Solution Section 

An alternative implementation can be obtained by solving for 0 as shown in 
Figure 3.12. In this implementation, at the end of the triangularization period, the 
data is passed from the first triangular section to the second triangular section in a 
reversed fashion. The second array performs the same type of operations as the first; 
therefore the cells are identical. 

The key to using another triangular section is that, by proper combinations of 
rows, we can force zeros into all elements of a given row but one, so that we can solve 
for each of the parameters. Also, the fact that orthogonal operations are used makes it 
more robust in the presence of numerical errors. 

To see how this works, consider an arbitrary set of equations in upper 
triangular form as in Equation (3.12). Now, x 3 is simply solved as b 3 ,'a 33 . To solve for 
x 2 , we can force a-> 3 to zero by a linear combination of rows two and three. We know 
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Figure 3.12 Systolic Array Implementation for Parameter Estimation: New Design. 
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(3.12) 



ll x l 


+ 


a 12 X 2 


+ a„x 3 


= b 


OXj 


+ 


a 22 X 2 


+ a J3 x 3 


= b 


Oxj 


-1- 


0x 2 


+ a 33 x 3 


= b 



this can be easily accomplished by Givens rotations. Then x 2 is found to be b 2 'a, 2 . 
Similarly, by row operations on rows l, 2, and 3 we can make a 12 = a 13 = 0 in row 
one. Again, solving for is a simple operation: bj/a^. These type of operations are 
exactly what the triangular array was designed to do. 

Figure 3.13 illustrates these operations in matrix format. Note that the array 
is initialized to all zeros here, whereas the first triangular section is initialized to <r o 0 kN 
and <r Q I . 
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Figure 3.13 Solution of System of Equations. 



To see how the system operates, recall the first triangular array at time N. 
Now we must feed the data down into the second triangular section in an appropriate 
manner so that we can solve for the parameters. The same delay (one clock cycle per 
cell) in propagation of data applies to this triangular section as it does in the first 
section. Hence, we must carefully choose when to sample the array in order to get the 
correct values with which to calculate the solution. 

Figure 3.14 shows the data flow for a simple system where d = 3. The input 
data is skewed as it was in the triangular section. It can be seen that the values a 33 , 
a„, and a u are available at times N+l, N + 4, and N + 7 respectively. Similarly, 
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values bj, b 2 , and bj are found at times N + 4, N + 6, and N+8. In general, for any 
size system n, the coefficients a- and outputs b. are available as seen in Table 2. Note 
from the figure that the coefficients are "picked off' from the edge cells at the 
appropriate times, while the outputs are found in the rightmost set of internal cells. 





TABLE 2 


TIMES OF AVAILABILITY OF DATA 


coefficient 


time available 


a nn 


N + 1 


a n-l ji-1 


N + 4 


a n-2ji-2 


N + 7 


b n 


N + (n + 1) 


b n-l 


N + (n + 3) 


b n-2 


N + (n + 5) 



This new design operates slower than the linear system seen in [Ref. 2: p. 36]. 
The time to solve for 0 in this design is equal to the time until bj appears, which is 
equal to (3n * 1). This compares to 2n (or 2d) in previous implementations; hence, n-1 
more clock cycles are required. 

On the other hand, simplification is gained in the manufacturing process since 
the number of unique cells is reduced. Additionally, use of the CORDIC rotation 
technique allows us to use simpler processors. The tradeoffs to be considered are 
simplicity (cost) versus speed. 
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IV. SYSTOLIC ARRAY MODELING 



A. GENERAL 

1. System Equations 

We now study a particular "unknown" system and the performance of the 
systolic array in identifying its parameters. We will simulate a system both with and 
without noise present. Additionally, we compare the results obtained by using floating 
point versus fixed point arithmetic. 

For purposes of the simulation, consider a plant with discrete time transfer 

function 



H(z) = 



1.3z 2 

(z-0.5) 3 



(4.1) 



which corresponds to the linear difference equation 

y(t) - 1.5y(t-l) + 0.75y(t-2) - 0.125y(t-3) = 1.3u(t-l) (4.2) 

In particular, let the input sequence be 

u(t) = sin(37tt/10) + sin(37tt/5) + sin(37tt/2) + sin(9rct/5) (4.3) 

The dimension of the parameter vector is four, defined as 0 = (-1.5, 0.75, -0.125, 1.3]. 
In the parameter estimation problem, these values are assumed to be unknown. The 
input is "sufficiently rich" in frequency (minimum of n sinusoids) to excite all modes of 
the system [Ref. 1: p. 74]. Results of the simulations are discussed in the following 
sections. 

2. Choice of Initial Values 

For the recursive algorithm, recall that we need to initialize the systolic array 
with a parameter estimate <r o 0(O) and The value of <y Q , which is related to the 
confidence in the initial estimate as discussed in Chapter III, is chosen to be one for all 
simulations. If some information about 0(0) is available, it should be used when 
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determining appropriate initial conditions. In the absence of any prior knowledge of 0. 
we choose 0 ( 0 ) = 0 . 

3. Choice of Block Length 

We have chosen four different block lengths during the simulation studies: 
N = 3, 5, 10, 15. It will be seen in the noiseless case that the parameters' exhibit the 
fastest rate of convergence when N = 5. However, when noise is present, there is a 
tradeoff: the system is more sensitive to disturbances when the block length is shorter. 
This is because a longer time block provides for more time averaging, thus attenuating 
the effects of noise. Therefore, with more samples available, disturbances have a lesser 
effect on the estimates. Hence, we must make an informed decision about what trait is 
most important in a specific application. 

4. Fixed Point versus Floating Point 

In the sections that follow, we also compare the performances obtained when 
using floating point processors or fixed point processors. Fixed point arithmetic 
operations are performed using finite registers and simple shift operations. Therefore, 
they are simpler to implement than floating point operations. Additionally, floating 
point operations in general require longer registers (exponent and mantissa) to 
represent numerical values, which might add complexity to the processor. Hence, the 
simplicity of fixed point arithmetic is desired. 

The problem to be solved in the fixed point case is how to convert the input 
data values, which we normally expect to be floating point, into fixed point values. 
The answer, of course, is to appropriately scale all numbers so that they stay within the 
limits of the fixed registers. This task would be assigned to the Analog/Digital (A/D) 
converter, and the scale factor used would be in large part dependent on the range of 
values of the discrete plant samples. 

B. SYSTEMS WITHOUT NOISE 

The system under consideration was first modeled in an ideal environment 
without noise to verify convergence of parameters. Figures 4.1 through 4.8 display the 
results of these simulations. In the figures, the estimated parameters (0j, 0 2 , 63 , 0 4 ) 
are plotted along the vertical axis, and the block number is plotted along the horizontal 
axis. Block number simply indicates the number of blocks that have been completed, 
where N identifies the block size. Specific results for the floating point and fixed point 
systems are discussed below. 
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Figure 4.1 Floating Point, Without Noise, N = 3. 
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Figure 4.2 Floating Point, Without Noise, N = 5. 
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Figure 4.3 Floating Point, Without Noise, N = 10. 
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BLOCK NO. 




Figure 4.4 Floating Point, Without Noise, N = 15. 
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1. Results Using Floating Point Arithmetic 

Figures 4.1 through 4.4 show the floating point results for block lengths of 
N = 3, 5, 10, and 15 respectively. In order to evaluate and compare the relative rates 
of convergence, we estimate (by visual inspection) in each case where the parameters 
appear to have converged to their correct values. These values are tabulated in 
Table 3. It can be seen that with N = 5, the least number of clock cycles were 
required. Therefore, in this particular case, we should choose a block length of five. 



TABLE 3 

TIME TO CONVERGENCE 
(NOISELESS CASE) 

BLOCK LENGTH BLOCK NO. 


TOTAL 


N 


AT CONVERGENCE 


CYCLES 


3 


16 


48 


5 


6 


30 


10 


4 


40 


15 


3.5 


52 



2. Results Using Fixed Point Arithmetic 

The results for the fixed point system are shown in Figures 4.5 through 4.8. 
Note that the parameters converge just as rapidly as they did in the floating point 
implementation. (The simulations were run on an IBM 3033 mainframe, in which 
there are 3 1 bits available for an integer number.) This indicates that the systolic array 
with fixed point processors performs as well as the floating point system. This is a 
distinct advantage when we consider the simplicity of fixed point processors as 
discussed previously. 

C. SYSTEM WITH NOISE 

The system was next modeled with noise present. The noise term v(t) is a 
sequence of independent random variables identically distributed with zero mean (as 
white noise). We use a variance of 0.5 for these noise random variables. The noise is 
added to both the incoming signal u(t) and measured output y(t), as would be expected 
in a real system. Figure 4.9 illustrates these signals both with and without noise. 
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Figure 4.5 Fixed Point, Without Noise, N = 3. 
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Figure 4.6 Fixed Point, Without Noise, N = 5. 
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Figure 4.7 Fixed Point, Without Noise, N = 10. 
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Figure 4.8 Fixed Point, Without Noise, N = 15. 
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Figure 4.9a Input and Output Signals Without Noise Present. 




Figure 4.9b Input and Output Signals With Noise Present. 
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1. Results Using Floating Point Arithmetic 

Figures 4.10 through 4.13 display the results for the four different block 
lengths. With N = 3, noise causes the parameters to vary considerably. When N = 5 
or 10, disturbances are less likely to affect the parameters. When block length is equal 
to 10, the parameters have converged reasonably well within three blocks (or a total of 
30 cycles). For N = 15, we see that the parameters are least affected by the noise, as 
expected. Here we find the relative convergence time is about three blocks, or 45 
cycles. 

2. Results Using Fixed Point Arithmetic 

The results for fixed point implementation are shown in Figures 4.14 through 
4.17. Again we note that they exhibit the same performance as in the floating point 
cases. Effects of block length are the same as noted previously. 
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Figure 4.10 Floating Point, With Noise, N = 3. 
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Figure 4.11 Floating Point, With Noise, N = 5. 



52 



o 

CM 



_ oi 




SH3I3HVHVd aaivmisa 



Figure 4.12 Floating Point, With Noise, N = 10. 
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Figure 4.13 Floating Point, With Noise, N = 15. 
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Figure 4.14 Fixed Point, With Noise, N = 3. 
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Figure 4.15 Fixed Point, With Noise, N = 5. 
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Figure 4.16 Fixed Point, With Noise, N = 10. 
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Figure 4.17 Fixed Point, With Noise, N = 15. 
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V. CONCLUSIONS 



The estimation of parameters using a parallel structure has been described in the 
preceding chapters. In this chapter, we discuss tradeoffs, advantages, and 
disadvantages of the various systems we have investigated. 

Several possible conditions have been simulated in order to investigate the 
behavior of the parallel algorithm. We saw that the parameters converged well under 
all conditions, even in the presence of noise. Use of block processing tends to average 
out the effects of noise, perhaps at the slight expense of convergence rate. 

A parallel algorithm suitable to hardware implementation has been presented in 
this research. The main contribution which distinguishes this algorithm from others 
available in the literature is the fact that we have replaced two different processors by 
two identical ones. With previous designs, a total of four different computing cells 
were required: two for the triangular section and two for the linear section. For the 
new design, we see that we need only two types of cells. These are the edge and 
boundary cells, whose operations were described in Chapter III. However, this new 
system also needs more total cells. Specifically, it can be shown that an additional 
l / zd(d+ 1) cells are necessary, where d = dimension of system. Even for a sixth order 
system (d= 6), this requires only 21 additional cells. In a VLSI scheme, additional cells 
are considered inexpensive. 

The second triangular section is somewhat slower than the linear section. We 
saw in Chapter III that (d-1) additional clock cycles were required to operate the 
second triangular section. 

The functions of the cells must also be considered. The use of the Givens 
rotation by Equation (3.2) requires the use of a processor which can perform squaring 
operations, obtain square roots, multiply, divide, and do simple add and subtracts. Use 
of the CORDIC technique greatly simplifies the operations, in the sense that rotations 
can be implemented by use of addition and shifts only. Note from Table 1 that after 
rotation, the vector has increased in magnitude by the amount listed in the column k ; . 
This is easily compensated for by performing a right shift (division by 2) every other 
rotation or so. 
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Another important consideration is the use of fixed point versus floating point 
arithmetic. We saw in the simulation that the simple fixed point processors perform 
equally as well as floating point processors. 

When the results of this thesis are considered, we see that a systolic array using 
the CORDIC technique presents an attractive means of implementing the parallel 
algorithm to identify unknown system parameters. This, in turn, leads to important 
applications in adaptive control systems and real time identification problems. 
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APPENDIX 

PROGRAM LISTING FOR SYSTOLIC ARRAY SIMULATION 



This program calculates the parameter estimates for a discrete time linear system. 
It is written in Waterloo Pascal. In accordance with Pascal syntax rules, the size of the 
arrays must be defined in the declarations. Hence, one must know the order of the 
system to be modeled before attempting to use this program. 



program systolic (input , output) ; 



const 

dimension = 4; plusone = 5; (* dimension is order of system *) 

gammacount = 30; (* anticipated max number of rotations *) 

timerstop = 15; (* used as index in main *) 



type 

gamma^vector = record 
pi: boolean; 
scale_factor : real; 

params: array ( .0. .gammacount. ) of integer; 
end; 

gamma_array = array (. 1 .. dimension, 1.. plusone.) of gamma_vector ; 
theta_array = array (. 1 .. dimension. ) of real; 
u_array = array ( .0. .dimension, 1.. plusone.) of real; 
a_array = array (. 1 .. dimension, 1.. plusone.) of real; 



var 

gamma: gamma_array; (* direction of rotations *) 

gamma2 : gamma_array; (* direction of rotations *) 

a.- a_array; 

a2 -. a_array; 

theta: theta_array; 

u: u_array; 

u2 : u_array; 

i, j ,k,block_length,count,index: integer: 
timer: integer; (* timer variable *) 
sigma: real; 
ch: char; 
h,m: text; 



Procedure initialize (var sigma: real; var block_length:integer) ; 
var 

i, j : integer; 
begin 

for i := 1 to dimension do 
for j := 1 to plusone do 
a2 ( . i , j . ) :~ 0; 

writeln; writeln(m); 

writeln ('Initialize the systolic array 1 ); 
writeln (m, 1 Initialize the systolic array 1 ); 
writeln ('Choose the value of sigma'); writeln: 
writeln (m, 'Choose the value of sigma'); writeln(m); 
readln (sigma): writeln (sigma); writeln(m, sigma) ; 
writeln; writeln(m) ; 

writeln ('Now enter the initial estimates of'); 
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writeln (m,'Now enter the initial estimates of 1 ); 
writeln ('the vector theta in order requested'); writeln; 
writeln (m,'the vector theta in order requested'); writeln(m); 
for i := 1 to dimension do 
begin 

writeln ('Theta 1 , i , ' = '); 
writeln (m, 'Theta ' , i , ' = '): 

readln (a( .i ? dimension + 1.)); writeln; writeln(m) ; 
writeln (a( . i, dimension + 1.)); writeln; 
writeln (m, a( . i, dimension + 1.)); writeln(m) ; 
end; 

writeln ('Enter the block length desired'); 
v/riteln (m, 'Enter the block length desired'); 
readln (block length) ; 
writeln (block_length:4) ; 
writeln (m,block_length :4) ; 
for i := 1 to dimension do 

for j := 1 to dimension do 
a(.i,j.) := 0; 
for i:= 1 to dimension do 
begin 

a(.i,i.) := sigma * 1; 

write (m,a( .i, dimension + l.):9:4); (*write initial values*) 
a( . i, dimension +1.) : = sigma * a( .i, dimension + 1.); 




^ ********************************************************************* ^ 
Procedure reinitialize (sigma: real); 



var 

i , j : integer; 
begin 

for i := 1 to dimension do 
for j := 1 to plusone do 
, a2(.i,j.) := 0; 

for l := 1 to dimension do 

a (.i, plusone.) := sigma * theta (.i.); 
for i := 1 to dimension do 

for j := 1 to dimension do 
a ( .i, j . ) := 0; 
for i := 1 to dimension do 
a( .i,i. ) := sigma * 1 ; 

end; 

^ **;*************;*:*** ***** ******************************* ************** ^ 



^ A*********************************************************** ********* ^ 

Procedure internal (x:real; var v,yout:real; var gammain, gammaout: 

gamma_vector) ; 

(* This procedure performs the rotation on the x-y pair, given the 

? amma vector which contains the number and directions in which 
o rotate. The new values of the x-y pair are passed out *) 



var 

i: integer; 
old_x, old_y : real; 



Procedure scale (var x,y*. real; magfactor: real) ; 
begin 

x:= x * magfactor; 
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y:= y * magfactor; 



end; 



Function two (isinteger) treal; 

(* This function calculates the exponentiation of the integer 2 to 
the power ( -i ) *) 



var 

ns integer; 
r? real; 

begin 

r := 1; 

if i = 0 then two : = 1 
else for n := 1 to i do 
r := r * (1/2); 
two : = r ; 

end; 



begin 

if gammain.pi then 

"begin x.- = -x; y:= -y; end; 

Old_X := X; old_y := y; 

i := 0; 

while (gammain.params( .i. ) <> 9) and (i < gammacount) do 
begin 

if gammain.params( . i. ) <> 0 then 
"begin 

x := old_x + gammain.params( . i . ) * two(i) * old_y; 
y := old_y - gammain.params( .i. ) * two(i) * old_x; 
old_x s= x; old_y s= y; 
end; 

i : = i + 1; 
end; 

scale (x,y, gammain.scale_factor) ; 
yout := y; y : = x; 
gammaout := gammain; 

end; 

(A********************************************************************) 



( ******* * ******* ********* * ******** ***** *** ******************* ********* ^ 
Procedure edge (xsreal; var ys real; var gamma: gamma_vector) ; 

(* This procedure builds the gamma vector based upon the values of 
the x-y pair. The gamma vector contains only the values -1, 0 ( 1, 
or 9. If -1, then the vector is to be rotated counterclockwise. 

If +1, then it must be rotated clockwise. The value is set to 
zero if the next rotation would cause the new absolute value of 
y to be greater than the previous absolute value of y. This way, 
we can prevent the rotation from taking place and cause the 
values to converge quickly. The value of 9 is placed into the 
gamma vector once a pre-determined lower limit of y is reached, 
and it signals that no more rotations are to take place. *) 

const 

low_limit = le-6; 

var 

i : integer; 

temp_x, temp_y: real; 
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Function two (isinteger) :real; 

(* This function calculates the exponentiation of the integer 
2 to the power ( -i ) *) 



var 

n: integer; 
r: real; 

begin 

r : = 1 • 

if i =0 then two := 1 
else for n := 1 to i do 
r s= r * (1/2); 
two := r; 

end; 



Procedure one_rotation (x,y: real; var temp_x, temp_y: real); 

(* This procedure is used to do a rotation on the x-y pair. 
The input values of x and y are not changed, but tne 
rotated values are passed out as temp_x and temp_y. *) 

var 

temp_gamma: integer; 
begin 

if y > low_limit then temp_gamma := 1 
else temp_gamma := -1; 
temp_x s= x + temp_gamma * two(i) * y; 
temp_y := y - temp_gamma * two(i) * x; 

end; 



begin 

i := 0; 
if x < 0 then begin 



(* initialize if x < 0 *) 



x := -x; y := -y; gamma. pi := true; end 
else gamma. pi := false; 
gamma. scale_f actor := 1; (* initialize scale factor *) 

while (abs(y) > low_limit) and (i < gammacount) do 
begin 

one rotation (x,y,temp x, temp y ) ; ( * check to see what rotated *) 

(* values of x,y will be *) 
while (abs(y) < abs(temp_y)) and (i < gammacount) do 



* repeat this loop until new *) 

* y < old y *j 



begin 

gamma .params (. i. ) := 0; 

I := i + 1; 

one_rotation (x,y,temp_x, temp_y) ; 
end; 

if y > low_limit then gamma. params (. i. ) := 1 (*do CW rotation * 
else if y < -low_limit then gamma. params( .i. ) := -1; (* CCW * 

i ;= i + 1; 



:= temp_x: y := temp_y ; , 

if y > low_limit then gamma. pa r ams ( .i.) := 1 
else if y < -low_limit then gamma. params ( .i. ) := -1; 
if i 3 1 then gamma. scale_factor := l/sqrt(2) (* update scaling *) 
else gamma. scale_f actor := gamma. scale_£ actor * 



(* update the rotated values*) 



end 



end; (* while *) 

gamma. params( .i. ) := 9; 
y ••= x * gamma. scale_factor; 



cos(arctan(two(i-l) ) ) ; 

end of rotations *) 

;* scale final values *) 
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begin (* main *) 

reset (h, 'data text al'); 
rewrite(m, 'outfile text al'); 
initialize (sigma, block_length) ; 
for timer := 1 to timerstop do 
begin 

for index := 1 to block_length do 
begin 

for j := 1 to dimension + 1 do 
read (h,u( .0 , j . ) ) ; 
readln (h); 

for k : = 2 to 2 * dimension + 1 do 
begin 

for i := 1 to dimension do 
begin 

j := k - i: 

if (i <= j) and (j <= dimension + 1) then 
begin 

if i = j then 

edge (u(.i-l,j.), a(.i,j.), gamma( . i , j . ) ) 
else 



end; 



end; 



end; 



end; 



internal (u( . i-1 , j . ) , , a( . i, j . ) , u(.i,j.), 
gamma ( . l , ■) - 1 . ) , gamma ( . l , j . ) ) ; 



C* if i < j *) 

(* for i = 1 to dimension *) 
(* for k *) 

(* for index *) 



(* solve system *) 



a2(.l,l.) := a( .dimension, dimension.); 
a2( 



shift bottom of *) 



* a into a2 



end. 



:( . 1 ,: . 

!( . 1 ,plusone . ) := a( .dimension, plusone.); 

theta (.dimension.) := a2( . 1 , plusone . ) / a2(.l,l.); 
count := dimension - 1; 
while count >= 1 do 
begin 

for i := dimension downto 2 do (* shift down a matrix *) 

for k := 1 to plusone do (* and feed into u *) 

. a( . j , k. ) := a ( . j - 1 , k . ) ; 
for j := 1 to dimension do 

u2(.0,j.) := a( .dimension, plusone - j.); 
u2( .0 , plusone . ) := a( .dimension, plusone.); 
for k := 2 to 2 * dimension + 1 do 
begin 

for i := 1 to dimension do 
begin 

j := k - i; 

if (i <= j) and (j <= dimension + 1) then 
begin 

if i = j then 

edge (u2 ( . i-1 , j . ) , a2(.i,j.), gamma2( .i, j . ) ) 
else 

internal (u2( .i-1 , j . ) , a2(.i,j.). u2(.i,i.), 

gamma2( .i,j-l . ) , gammaz( . l , j . ) ) ; 
end; (* if i < j *) 

end; (* for i = 1 to dimension *) 

end; (* for k *) 

theta (.count.) := a2 (.dimension - count + 1, plusone.) / 

a2 (.dimension - count + 1, dimension - count + 1.); 
count := count - 1; 
end; (* while *) 

for i := 1 to dimension do (* write out values *) 

write (m, theta (.i.):9:4); 

reinitialize (sigma); 
end; (* for timer *) 
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